<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Chapter 2. Parsers</title>
<meta name="generator" content="DocBook XSL Stylesheets V1.78.1">
<link rel="home" href="index.html" title="libxml++ - An XML Parser for C++">
<link rel="up" href="index.html" title="libxml++ - An XML Parser for C++">
<link rel="prev" href="ch01s03.html" title="Compilation and Linking">
<link rel="next" href="ch02s02.html" title="SAX Parser">
</head>
<body bgcolor="white" text="black" link="#0000FF" vlink="#840084" alink="#0000FF">
<div class="navheader">
<table width="100%" summary="Navigation header">
<tr><th colspan="3" align="center">Chapter 2. Parsers</th></tr>
<tr>
<td width="20%" align="left">
<a accesskey="p" href="ch01s03.html">Prev</a> </td>
<th width="60%" align="center"> </th>
<td width="20%" align="right"> <a accesskey="n" href="ch02s02.html">Next</a>
</td>
</tr>
</table>
<hr>
</div>
<div class="chapter">
<div class="titlepage"><div><div><h1 class="title">
<a name="chapter-parsers"></a>Chapter 2. Parsers</h1></div></div></div>
<div class="toc">
<p><b>Table of Contents</b></p>
<ul class="toc">
<li><span class="sect1"><a href="chapter-parsers.html#idp58339248">DOM Parser</a></span></li>
<li><span class="sect1"><a href="ch02s02.html">SAX Parser</a></span></li>
<li><span class="sect1"><a href="ch02s03.html">TextReader Parser</a></span></li>
</ul>
</div>
<p>Like the underlying libxml2 library, libxml++ allows the use of 3 parsers, depending on your needs - the DOM, SAX, and TextReader parsers. The relative advantages and behaviour of these parsers will be explained here.</p>
<p>All of the parsers may parse XML documents directly from disk, a string, or a C++ std::istream. Although the libxml++ API uses only Glib::ustring, and therefore the UTF-8 encoding, libxml++ can parse documents in any encoding, converting to UTF-8 automatically. This conversion will not lose any information because UTF-8 can represent any locale.</p>
<p>Remember that white space is usually significant in XML documents, so the parsers might provide unexpected text nodes that contain only spaces and new lines. The parser does not know whether you care about these text nodes, but your application may choose to ignore them.</p>
<div class="sect1">
<div class="titlepage"><div><div><h2 class="title" style="clear: both">
<a name="idp58339248"></a>DOM Parser</h2></div></div></div>
<p>The DOM (Document Object Model) parser parses the whole document at once and stores the structure in memory, available via <code class="methodname">DomParser::get_document()</code>. With methods such as <code class="methodname">Document::get_root_node()</code> and <code class="methodname">Node::get_children()</code>, you may then navigate into the hierarchy of XML nodes without restriction, jumping forwards or backwards in the document based on the information that you encounter. Therefore the DOM parser uses a relatively large amount of memory.</p>
<p>You should use C++ RTTI (via <code class="literal">dynamic_cast&lt;&gt;</code>) to identify the specific node type and to perform actions which are not possible with all node types. For instance, only <code class="classname">Element</code>s have attributes. Here is the inheritance hierarchy of node types:</p>
<p>
      </p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: disc; "><li class="listitem">
<p>xmlpp::Node
        </p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: circle; ">
<li class="listitem">
<p>xmlpp::Attribute
          </p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: square; ">
<li class="listitem"><p>xmlpp::AttributeDeclaration</p></li>
<li class="listitem"><p>xmlpp::AttributeNode</p></li>
</ul></div>
<p>
          </p>
</li>
<li class="listitem">
<p>xmlpp::ContentNode
          </p>
<div class="itemizedlist"><ul class="itemizedlist" style="list-style-type: square; ">
<li class="listitem"><p>xmlpp::CdataNode</p></li>
<li class="listitem"><p>xmlpp::CommentNode</p></li>
<li class="listitem"><p>xmlpp::EntityDeclaration</p></li>
<li class="listitem"><p>xmlpp::ProcessingInstructionNode</p></li>
<li class="listitem"><p>xmlpp::TextNode</p></li>
</ul></div>
<p>
          </p>
</li>
<li class="listitem"><p>xmlpp::Element</p></li>
<li class="listitem"><p>xmlpp::EntityReference</p></li>
<li class="listitem"><p>xmlpp::XIncludeEnd</p></li>
<li class="listitem"><p>xmlpp::XIncludeStart</p></li>
</ul></div>
<p>
        </p>
</li></ul></div>
<p>
    </p>
<p>All <code class="classname">Node</code>s created by the DOM parser are leaves
      in the node type tree. For instance, the DOM parser can create
      <code class="classname">TextNode</code>s and <code class="classname">Element</code>s, but it
      does not create objects whose exact type is <code class="classname">ContentNode</code>
      or <code class="classname">Node</code>.
    </p>
<p>Although you may obtain pointers to the <code class="classname">Node</code>s, these <code class="classname">Node</code>s are always owned by their parent <code class="classname">Node</code>. In most cases that means that the <code class="classname">Node</code> will exist, and your pointer will be valid, as long as the <code class="classname">Document</code> instance exists.</p>
<p>There are also several methods which can create new child <code class="classname">Node</code>s. By using these, and one of the <code class="methodname">Document::write_*()</code> methods, you can use libxml++ to build a new XML document.</p>
<div class="sect2">
<div class="titlepage"><div><div><h3 class="title">
<a name="idp58358688"></a>Example</h3></div></div></div>
<p>This example looks in the document for expected elements and then examines them. All these examples are included in the libxml++ source distribution.</p>
<p><a class="ulink" href="http://git.gnome.org/browse/libxml++/tree/examples/dom_parser" target="_top">Source Code</a></p>
<p>File: main.cc
</p>
<pre class="programlisting">
#ifdef HAVE_CONFIG_H
#include &lt;config.h&gt;
#endif

#include "../testutilities.h"
#include &lt;libxml++/libxml++.h&gt;
#include &lt;iostream&gt;
#include &lt;cstdlib&gt;

void print_node(const xmlpp::Node* node, unsigned int indentation = 0)
{
  const Glib::ustring indent(indentation, ' ');
  std::cout &lt;&lt; std::endl; //Separate nodes by an empty line.
  
  const auto nodeContent = dynamic_cast&lt;const xmlpp::ContentNode*&gt;(node);
  const auto nodeText = dynamic_cast&lt;const xmlpp::TextNode*&gt;(node);
  const auto nodeComment = dynamic_cast&lt;const xmlpp::CommentNode*&gt;(node);

  if(nodeText &amp;&amp; nodeText-&gt;is_white_space()) //Let's ignore the indenting - you don't always want to do this.
    return;
    
  const auto nodename = node-&gt;get_name();

  if(!nodeText &amp;&amp; !nodeComment &amp;&amp; !nodename.empty()) //Let's not say "name: text".
  {
    const auto namespace_prefix = node-&gt;get_namespace_prefix();

    std::cout &lt;&lt; indent &lt;&lt; "Node name = ";
    if(!namespace_prefix.empty())
      std::cout &lt;&lt; CatchConvertError(namespace_prefix) &lt;&lt; ":";
    std::cout &lt;&lt; CatchConvertError(nodename) &lt;&lt; std::endl;
  }
  else if(nodeText) //Let's say when it's text. - e.g. let's say what that white space is.
  {
    std::cout &lt;&lt; indent &lt;&lt; "Text Node" &lt;&lt; std::endl;
  }

  //Treat the various node types differently: 
  if(nodeText)
  {
    std::cout &lt;&lt; indent &lt;&lt; "text = \"" &lt;&lt; CatchConvertError(nodeText-&gt;get_content()) &lt;&lt; "\"" &lt;&lt; std::endl;
  }
  else if(nodeComment)
  {
    std::cout &lt;&lt; indent &lt;&lt; "comment = " &lt;&lt; CatchConvertError(nodeComment-&gt;get_content()) &lt;&lt; std::endl;
  }
  else if(nodeContent)
  {
    std::cout &lt;&lt; indent &lt;&lt; "content = " &lt;&lt; CatchConvertError(nodeContent-&gt;get_content()) &lt;&lt; std::endl;
  }
  else if(const xmlpp::Element* nodeElement = dynamic_cast&lt;const xmlpp::Element*&gt;(node))
  {
    //A normal Element node:

    //line() works only for ElementNodes.
    std::cout &lt;&lt; indent &lt;&lt; "     line = " &lt;&lt; node-&gt;get_line() &lt;&lt; std::endl;

    //Print attributes:
    const auto attributes = nodeElement-&gt;get_attributes();
    for(xmlpp::Element::AttributeList::const_iterator iter = attributes.begin(); iter != attributes.end(); ++iter)
    {
      const auto attribute = *iter;
      const auto namespace_prefix = attribute-&gt;get_namespace_prefix();

      std::cout &lt;&lt; indent &lt;&lt; "  Attribute ";
      if(!namespace_prefix.empty())
        std::cout &lt;&lt; CatchConvertError(namespace_prefix) &lt;&lt; ":";
      std::cout &lt;&lt; CatchConvertError(attribute-&gt;get_name()) &lt;&lt; " = "
                &lt;&lt; CatchConvertError(attribute-&gt;get_value()) &lt;&lt; std::endl;
    }

    const auto attribute = nodeElement-&gt;get_attribute("title");
    if(attribute)
    {
      std::cout &lt;&lt; indent;
      if (dynamic_cast&lt;const xmlpp::AttributeNode*&gt;(attribute))
        std::cout &lt;&lt; "AttributeNode ";
      else if (dynamic_cast&lt;const xmlpp::AttributeDeclaration*&gt;(attribute))
        std::cout &lt;&lt; "AttributeDeclaration ";
      std::cout &lt;&lt; "title = " &lt;&lt; CatchConvertError(attribute-&gt;get_value()) &lt;&lt; std::endl;
    }
  }
  
  if(!nodeContent)
  {
    //Recurse through child nodes:
    auto list = node-&gt;get_children();
    for(const auto&amp; child : node-&gt;get_children())
    {
      print_node(child, indentation + 2); //recursive
    }
  }
}

int main(int argc, char* argv[])
{
  // Set the global C++ locale to the user-specified locale. Then we can
  // hopefully use std::cout with UTF-8, via Glib::ustring, without exceptions.
  std::locale::global(std::locale(""));

  bool validate = false;
  bool set_throw_messages = false;
  bool throw_messages = false;
  bool substitute_entities = true;
  bool include_default_attributes = false;

  int argi = 1;
  while (argc &gt; argi &amp;&amp; *argv[argi] == '-') // option
  {
    switch (*(argv[argi]+1))
    {
      case 'v':
        validate = true;
        break;
      case 't':
        set_throw_messages = true;
        throw_messages = true;
        break;
      case 'e':
        set_throw_messages = true;
        throw_messages = false;
        break;
      case 'E':
        substitute_entities = false;
        break;
      case 'a':
        include_default_attributes = true;
        break;
     default:
       std::cout &lt;&lt; "Usage: " &lt;&lt; argv[0] &lt;&lt; " [-v] [-t] [-e] [filename]" &lt;&lt; std::endl
                 &lt;&lt; "       -v  Validate" &lt;&lt; std::endl
                 &lt;&lt; "       -t  Throw messages in an exception" &lt;&lt; std::endl
                 &lt;&lt; "       -e  Write messages to stderr" &lt;&lt; std::endl
                 &lt;&lt; "       -E  Do not substitute entities" &lt;&lt; std::endl
                 &lt;&lt; "       -a  Include default attributes in the node tree" &lt;&lt; std::endl;
       return EXIT_FAILURE;
     }
     argi++;
  }
  std::string filepath;
  if(argc &gt; argi)
    filepath = argv[argi]; //Allow the user to specify a different XML file to parse.
  else
    filepath = "example.xml";
 
  try
  {
    xmlpp::DomParser parser;
    if (validate)
      parser.set_validate();
    if (set_throw_messages)
      parser.set_throw_messages(throw_messages);
    //We can have the text resolved/unescaped automatically.
    parser.set_substitute_entities(substitute_entities);
    parser.set_include_default_attributes(include_default_attributes);
    parser.parse_file(filepath);
    if(parser)
    {
      //Walk the tree:
      const auto pNode = parser.get_document()-&gt;get_root_node(); //deleted by DomParser.
      print_node(pNode);
    }
  }
  catch(const std::exception&amp; ex)
  {
    std::cerr &lt;&lt; "Exception caught: " &lt;&lt; ex.what() &lt;&lt; std::endl;
    return EXIT_FAILURE;
  }

  return EXIT_SUCCESS;
}

</pre>
<p>
</p>
</div>
</div>
</div>
<div class="navfooter">
<hr>
<table width="100%" summary="Navigation footer">
<tr>
<td width="40%" align="left">
<a accesskey="p" href="ch01s03.html">Prev</a> </td>
<td width="20%" align="center"> </td>
<td width="40%" align="right"> <a accesskey="n" href="ch02s02.html">Next</a>
</td>
</tr>
<tr>
<td width="40%" align="left" valign="top">Compilation and Linking </td>
<td width="20%" align="center"><a accesskey="h" href="index.html">Home</a></td>
<td width="40%" align="right" valign="top"> SAX Parser</td>
</tr>
</table>
</div>
</body>
</html>
